Skip to content

feat(i18n): add Traditional + Simplified Chinese entity detection#945

Merged
igorls merged 2 commits intoMemPalace:developfrom
lmanchu:feat/zh-entity-detection
Apr 21, 2026
Merged

feat(i18n): add Traditional + Simplified Chinese entity detection#945
igorls merged 2 commits intoMemPalace:developfrom
lmanchu:feat/zh-entity-detection

Conversation

@lmanchu
Copy link
Copy Markdown
Contributor

@lmanchu lmanchu commented Apr 16, 2026

Problem

zh-TW and zh-CN are shipped in mempalace/i18n/ but have no entity section. When a Chinese user runs:

detect_entities(paths, languages=("zh-TW",))

get_entity_patterns() silently falls back to English (i18n/__init__.py:231-233), so the English candidate pattern [A-Z][a-z]{1,19} is applied to Chinese text. Result: zero Chinese names extracted, only Latin-script names embedded in the Chinese document. ja and ko share the same bug (follow-up PRs).

Reproduction (before this PR)

from mempalace.entity_detector import extract_candidates

zh_text = "朱宜振 主持會議。朱宜振 同意 Jeffrey 的方案。朱宜振: 決定 ship。"
extract_candidates(zh_text, languages=("zh-TW",))
# → {}                    ← no Chinese names
extract_candidates(zh_text, languages=("zh-TW", "en"))
# → {"Jeffrey": 1}        ← only English name, misses 朱宜振 entirely

Approach

Add entity sections to zh-TW.json and zh-CN.json that work within the current framework's constraints:

  • candidate_pattern: common-surname-prefixed CJK n-grams. ~100 surnames covering >95% of Taiwanese and PRC names. Length is capped at {1,2} trailing chars so greedy matching doesn't swallow the trailing verb (e.g. 朱宜振說 → entity 朱宜振說 is wrong).
  • boundary_chars: \u4E00-\u9FFF: reuses the script-aware \b infrastructure from fix(entity_detector): script-aware word boundaries for combining-mark scripts #932. Applied to CJK, \b fires at CJK↔non-CJK transitions — the same mechanism Devanagari uses.
  • person_verb_patterns: Chinese verbs attach directly to the name with no whitespace, so patterns are written as {name}說, {name}問, {name}決定 — no \b or \s+ between them.
  • dialogue_patterns: full-width colon , Chinese quotes 「」『』, plus the standard Latin forms.
  • pronoun_patterns: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
  • stopwords: ~140 entries — particles, pronouns, time expressions, question words, conjunctions, UI nouns, politeness forms.

What you get

# After this PR
zh_text = (
    "# 會議紀錄\n"
    "- 朱宜振 主持\n"
    "- Jeffrey Lai 報告融資\n"
    "朱宜振 跟 Jeffrey 討論 pitch。\n"
    "朱宜振: 「我們要 6 月 launch。」\n"
    "朱宜振 同意 Arnold 的方案。\n"
    "朱宜振 決定 ship pitch。\n"
    # ...8 more mentions...
)
detect_entities(..., languages=("zh-TW", "en"))
# people:    [('朱宜振', 0.99)]       ← correctly classified as person
# uncertain: [('Jeffrey Lai', 0.06), ...]

Known Limitation (documented in tests)

CJK scripts have no word delimiters. A name flanked by CJK on both sides with no punctuation or whitespace break is not extracted — the framework's \b(...)\b wrap can't fire between two CJK characters without a dictionary tokeniser. A test covers this adversarial case explicitly (test_zh_tw_known_limitation_inline_name_no_boundary).

In practice this rarely degrades recall: realistic Chinese technical writing has many non-CJK neighbours (bullet lines, inline English, full-width punctuation, newlines), so names that appear 3+ times across a document almost always land at a matchable boundary somewhere. Verified on a realistic zh-TW PKM note: 朱宜振 appearing in 8 sentences was extracted 11x with 0.99 person-classification confidence.

Testing

  • 7 new tests in tests/test_entity_detector.py:
    • test_zh_tw_candidate_extraction_at_boundaries
    • test_zh_tw_person_classification
    • test_zh_tw_stopwords_filter_common_particles
    • test_zh_tw_falls_back_to_english_for_non_cjk_names
    • test_zh_cn_candidate_extraction
    • test_zh_cn_and_zh_tw_union_covers_both_variants
    • test_zh_tw_known_limitation_inline_name_no_boundary
  • Full suite: 957 passed, 0 failed (pytest tests/ -q).
  • Ruff clean (ruff check mempalace/i18n/ tests/test_entity_detector.py).

Follow-ups (separate PRs)

  • ja.json: same treatment (currently falls back to English).
  • ko.json: same treatment.

Checklist

  • Tests pass (pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check)
  • No new dependencies
  • Targets develop per CONTRIBUTING.md

zh-TW and zh-CN previously had no `entity` section. Calling
`detect_entities(..., languages=("zh-TW",))` silently fell back to
English patterns (i18n/__init__.py:231-233), so no Chinese names
were ever extracted — Chinese-speaking users got zero people or
projects detected from their own notes.

This adds entity sections for both locales:

- `candidate_pattern`: common-surname-prefixed CJK n-grams (~100
  surnames covering >95% of Taiwanese / PRC names), length capped
  at {1,2} trailing chars so greedy matches don't swallow the
  trailing verb character (e.g. 朱宜振說).
- `boundary_chars`: `\u4E00-\u9FFF` so the i18n loader's
  script-aware wrap (introduced in MemPalace#932) fires `\b` at CJK↔non-CJK
  transitions. This is the same mechanism used for Devanagari,
  applied to the CJK range.
- `person_verb_patterns`: Chinese verbs attach directly to the
  name with no whitespace, so patterns are written as `{name}說`,
  `{name}問`, `{name}決定` — no `\b` or `\s+` separators.
- `dialogue_patterns`: full-width colon `:`, Chinese quotes
  「」『』, plus the standard Latin forms.
- `pronoun_patterns`: 他 / 她 / 它 / 他們 / 她們 / 您 / 咱.
- `stopwords`: ~140 common particles, pronouns, time expressions,
  question words, conjunctions, UI nouns, and politeness forms.

**Known limitation** (explicitly covered by a test): CJK scripts
have no word delimiters, so a name flanked by CJK on both sides
with no punctuation or whitespace break is not extracted. This
is a fundamental limit of regex-based CJK entity detection —
resolving it would require a dictionary tokeniser. Realistic
Chinese technical writing contains enough non-CJK neighbours
(bullet lines, inline English, full-width punctuation, newlines)
that 3+ occurrences normally produce matches. Verified against a
realistic zh-TW PKM note: 朱宜振 extracted 11x from 8 sentences
with 0.99 person-classification confidence.

**Follow-ups** (separate PRs): same pattern for `ja` and `ko`,
both of which currently share the silent fallback-to-English bug.

Tests: 7 new tests in `tests/test_entity_detector.py`:
- `test_zh_tw_candidate_extraction_at_boundaries`
- `test_zh_tw_person_classification`
- `test_zh_tw_stopwords_filter_common_particles`
- `test_zh_tw_falls_back_to_english_for_non_cjk_names`
- `test_zh_cn_candidate_extraction`
- `test_zh_cn_and_zh_tw_union_covers_both_variants`
- `test_zh_tw_known_limitation_inline_name_no_boundary`

Full suite: 957 passed, 0 failed.
Collapse implicit string concatenation to single-line strings
to satisfy ruff format --check in CI.

Co-Authored-By: Claude <noreply@anthropic.com>
@igorls
Copy link
Copy Markdown
Member

igorls commented Apr 18, 2026

I think the new ASCII command-style project patterns in zh-TW.json / zh-CN.json are being neutralized by the script-boundary expansion.

Specifically, patterns like \bimport\s+{name}\b and \bpip\s+install\s+{name}\b go through _expand_b(...) whenever boundary_chars is set for the locale. For Chinese, that rewrites \b into a CJK-transition boundary rather than normal Python word-boundary semantics.

As a result, the expanded regex no longer matches plain ASCII text at all. So these project signals are effectively dead code in the current implementation.

I’d suggest either:

  • removing \b from those two patterns for the Chinese locales, or
  • avoiding boundary expansion for explicitly ASCII-oriented patterns like import/pip commands.

Everything else in the PR looked consistent to me, but I don’t think these two patterns currently do what the PR intends.

@igorls igorls merged commit 2a5914b into MemPalace:develop Apr 21, 2026
6 checks passed
jphein pushed a commit to jphein/mempalace that referenced this pull request Apr 24, 2026
Restore-integrity release. Unbreaks fresh `pip install mempalace` from
v3.3.2 by re-tagging current develop, which carries both the plugin.json
consumer (shipped in 3.3.2) and the matching mempalace-mcp entry point
in pyproject.toml (added on develop ~10h after the 3.3.2 tag via MemPalace#340
by @messelink). MemPalace#1093 diagnosed by @jphein.

Bumps (all 5 sources agree per Version Guard / CLAUDE.md):
- mempalace/version.py              3.3.2 → 3.3.3
- pyproject.toml                     3.3.2 → 3.3.3
- .claude-plugin/plugin.json         3.3.2 → 3.3.3
- .claude-plugin/marketplace.json    3.3.2 → 3.3.3
- .codex-plugin/plugin.json          3.3.2 → 3.3.3
- CHANGELOG.md                        new [3.3.3] entry

No code changes. The fix for MemPalace#1093 is already on develop via merged PRs
MemPalace#340, MemPalace#1021, MemPalace#851, MemPalace#942, MemPalace#833, MemPalace#673, MemPalace#661, MemPalace#659, MemPalace#1097, MemPalace#1051, MemPalace#1001,
MemPalace#945.

Branch name intentionally outside the `release/*` ruleset so follow-up
CI-fix commits aren't gated behind a nested PR. (Supersedes MemPalace#1143 —
closed for exactly that reason after it missed 3 of 5 version files.)

Smoke-tested locally from a fresh develop clone:
  grep mempalace-mcp pyproject.toml .claude-plugin/plugin.json   # both ✓
  python -m build --wheel                                        # ✓
  pip install …-py3-none-any.whl                                 # ✓
  which mempalace-mcp                                            # ✓
  mempalace-mcp --help                                           # ✓
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants